Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset
نویسندگان
چکیده
Archived organizational email datasets have been considered valuable data resources for various studies, such as spam detection, email classification, Social Network Analysis (SNA), and text mining. Similar to other forms of raw data, email data can be messy and needs to be cleaned before any analysis is conducted. However, few studies have presented investigation on the cleaning of archived organizational emails. This paper examines the properties of organizational emails and difficulties faced in the cleaning process. Cleaning strategies are then proposed to solve the identified problems. The strategies are applied to the Enron email dataset. Contact: Yingjie Zhou Dept. of Decision Sciences and Engineering Systems Rensselaer Polytechnic Institute Troy, NY 12180 Tel: 1-518-276-8457 Fax: 1-518-276-8227 Email: [email protected]
منابع مشابه
Detecting Unusual and Deceptive Communication in Email
Deception theory suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words, and elevated frequency of negative emotion words and action verbs. We apply this model of deception to the Enron email dataset, and then apply singular value decomposition to elicit the correlation structure between emails. This allows us to rank emails by how wel...
متن کاملDetecting unusual email communication
Deception theory suggests that deceptive writing is characterized by reduced frequency of firstperson pronouns and exclusive words, and elevated frequency of negative emotion words and action verbs. We apply this model of deception to the Enron email dataset, and then apply singular value decomposition to elicit the correlation structure between emails. Those emails that have high scores using ...
متن کاملLearning User Embeddings from Emails
Many important email-related tasks, such as email classification or search, highly rely on building quality document representations (e.g., bag-of-words or key phrases) to assist matching and understanding. Despite prior success on representing textual messages, creating quality user representations from emails was overlooked. In this paper, we propose to represent users using embeddings that a...
متن کاملEmail Classification Using Machine Learning Algorithms
Email has become one of the frequently used forms of communication. Everyone has at least one email account. Inflow of spam messages is a major problem faced by email users. Currently there are many spam filtering techniques. As the spam filtering techniques came up, spammers improved their methods of spamming. Thus, an effective spam filtering technique is the timely requirement. In this paper...
متن کاملInferring Formal Titles in Organizational Email Archives
In the social network of large groups of people, such as companies and organizations, formal hierarchies with titles and lines of authority are established to define the responsibilities and order of power within that group. Although this information may be readily available for individuals within that group, the context this hierarchy provides in communications is not available to those outsid...
متن کامل